Using Sketches to Estimate Associations
نویسندگان
چکیده
We should not have to look at the entire corpus (e.g., the Web) to know if two words are associated or not.1 A powerful sampling technique called Sketches was originally introduced to remove duplicate Web pages. We generalize sketches to estimate contingency tables and associations, using a maximum likelihood estimator to find the most likely contingency table given the sample, the margins (document frequencies) and the size of the collection. Not unsurprisingly, computational work and statistical accuracy (variance or errors) depend on sampling rate, as will be shown both theoretically and empirically. Sampling methods become more and more important with larger and larger collections. At Web scale, sampling rates as low as 10−4 may suffice.
منابع مشابه
Using Sketches to Estimate Two-way and Multi-way Associations
We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are associated or not. A powerful sampling technique called Sketches was originally introduced to remove duplicate Web pages. We generalize sketches to estimate contingency tables and associations, using a maximum likelihood estimator to find the most likely contingency table given the sample, the mar...
متن کاملUsing Hedonic Prices to Estimate Quality Changes concerning Iranian Automobile Market
Abstract This paper sketches a model of product differentiation according to the hedonic hypothesis that is based on the theory of consumer behavior of Lancaster (1971). Lancaster suggested that utility is derived from the characteristics of the good and not the good itself. Thus, from the perception of the consumer, every characteristic has a price. This is the hedonic (or implicit) price. We ...
متن کاملThinking with Sketches
Sketches serve to externalize ideas, to render fleeting ideas permanent, to confer coherence on scattered concepts, to turn internal thoughts public. They can be created and recreated, examined and reexamined, configured and reconfigured, considered and reconsidered, for clarity and for creativity. The schematic vocabulary of sketches allows both expression and discovery of ideas. Sketching is ...
متن کاملA Sketch Algorithm for Estimating Two-Way and Multi-Way Associations
We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are strongly associated or not. One can often obtain estimates of associations from a small sample. We develop a sketch-based algorithm that constructs a contingency table for a sample. One can estimate the contingency table for the entire population using straightforward scaling. However, one can do ...
متن کاملNew cardinality estimation algorithms for HyperLogLog sketches
This paper presents new methods to estimate the cardinalities of multisets recorded by HyperLogLog sketches. A theoretically motivated extension to the original estimator is presented that eliminates the bias for small and large cardinalities. Based on the maximum likelihood principle a second unbiased method is derived together with a robust and efficient numerical algorithm to calculate the e...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005